[SPARK-47068][PYTHON][TESTS] Recover -1 and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch #45132

HyukjinKwon · 2024-02-16T01:24:47Z

What changes were proposed in this pull request?

This PR fixes the regression introduced by #36683.

import pandas as pd
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", 0)
spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", False)
spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas()

spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", -1)
spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas()

Before

/.../spark/python/pyspark/sql/pandas/conversion.py:371: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and will not continue because automatic fallback with 'spark.sql.execution.arrow.pyspark.fallback.enabled' has been set to false.
  range() arg 3 must not be zero
  warn(msg)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/session.py", line 1483, in createDataFrame
    return super(SparkSession, self).createDataFrame(  # type: ignore[call-overload]
  File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 351, in createDataFrame
    return self._create_from_pandas_with_arrow(data, schema, timezone)
  File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 633, in _create_from_pandas_with_arrow
    pdf_slices = (pdf.iloc[start : start + step] for start in range(0, len(pdf), step))
ValueError: range() arg 3 must not be zero

Empty DataFrame
Columns: [a]
Index: []

After

     a
0  123

     a
0  123

Why are the changes needed?

It fixes a regerssion. This is a documented behaviour. It should be backported to branch-3.4 and branch-3.5.

Does this PR introduce any user-facing change?

Yes, it fixes a regression as described above.

How was this patch tested?

Unittest was added.

Was this patch authored or co-authored using generative AI tooling?

No.

dongjoon-hyun

+1, LGTM (Pending CIs)

HyukjinKwon · 2024-02-16T03:41:15Z

Merged to master, brnach-3.5 and branch-3.4.

…ution.arrow.maxRecordsPerBatch This PR fixes the regression introduced by #36683. ```python import pandas as pd spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true") spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", 0) spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", False) spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas() spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", -1) spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas() ``` **Before** ``` /.../spark/python/pyspark/sql/pandas/conversion.py:371: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and will not continue because automatic fallback with 'spark.sql.execution.arrow.pyspark.fallback.enabled' has been set to false. range() arg 3 must not be zero warn(msg) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 1483, in createDataFrame return super(SparkSession, self).createDataFrame( # type: ignore[call-overload] File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 351, in createDataFrame return self._create_from_pandas_with_arrow(data, schema, timezone) File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 633, in _create_from_pandas_with_arrow pdf_slices = (pdf.iloc[start : start + step] for start in range(0, len(pdf), step)) ValueError: range() arg 3 must not be zero ``` ``` Empty DataFrame Columns: [a] Index: [] ``` **After** ``` a 0 123 ``` ``` a 0 123 ``` It fixes a regerssion. This is a documented behaviour. It should be backported to branch-3.4 and branch-3.5. Yes, it fixes a regression as described above. Unittest was added. No. Closes #45132 from HyukjinKwon/SPARK-47068. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 3bb762d) Signed-off-by: Hyukjin Kwon <[email protected]>

…ution.arrow.maxRecordsPerBatch This PR fixes the regression introduced by apache#36683. ```python import pandas as pd spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true") spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", 0) spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", False) spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas() spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", -1) spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas() ``` **Before** ``` /.../spark/python/pyspark/sql/pandas/conversion.py:371: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and will not continue because automatic fallback with 'spark.sql.execution.arrow.pyspark.fallback.enabled' has been set to false. range() arg 3 must not be zero warn(msg) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 1483, in createDataFrame return super(SparkSession, self).createDataFrame( # type: ignore[call-overload] File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 351, in createDataFrame return self._create_from_pandas_with_arrow(data, schema, timezone) File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 633, in _create_from_pandas_with_arrow pdf_slices = (pdf.iloc[start : start + step] for start in range(0, len(pdf), step)) ValueError: range() arg 3 must not be zero ``` ``` Empty DataFrame Columns: [a] Index: [] ``` **After** ``` a 0 123 ``` ``` a 0 123 ``` It fixes a regerssion. This is a documented behaviour. It should be backported to branch-3.4 and branch-3.5. Yes, it fixes a regression as described above. Unittest was added. No. Closes apache#45132 from HyukjinKwon/SPARK-47068. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 3bb762d) Signed-off-by: Hyukjin Kwon <[email protected]>

Recover -1 and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch

93f1ce5

github-actions bot added SQL PYTHON labels Feb 16, 2024

HyukjinKwon changed the title ~~Recover -1 and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch~~ [SPARK-47068][PYTHON][TESTS] Recover -1 and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch Feb 16, 2024

dongjoon-hyun approved these changes Feb 16, 2024

View reviewed changes

HyukjinKwon closed this in 3bb762d Feb 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-47068][PYTHON][TESTS] Recover -1 and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch #45132

[SPARK-47068][PYTHON][TESTS] Recover -1 and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch #45132

Uh oh!

HyukjinKwon commented Feb 16, 2024

Uh oh!

dongjoon-hyun left a comment

Uh oh!

HyukjinKwon commented Feb 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-47068][PYTHON][TESTS] Recover -1 and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch #45132

[SPARK-47068][PYTHON][TESTS] Recover -1 and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch #45132

Uh oh!

Conversation

HyukjinKwon commented Feb 16, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Feb 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants